Controlling the Chunk-Size in Deduplication Systems

نویسندگان

  • Michael Hirsch
  • Shmuel Tomi Klein
  • Dana Shapira
  • Yair Toaff
چکیده

A special case of data compression in which repeated chunks of data are stored only once is known as deduplication. The input data is cut into chunks and a cryptographically strong hash value of each (different) chunk is stored. To restrict the influence of small inserts and deletes to local perturbations, the chunk boundaries are usually defined in a data dependent way, which implies that the chunks are of variable length. Usually, the chunk sizes may spread over a large range, which may have a negative impact on the storage performance. This may be dealt with by imposing artificial lower and upper bounds. This paper suggests an alternative by which the chunk size distribution is controlled in a natural way. Some analytical and experimental results are given.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving restore speed for backup systems that use inline chunk-based deduplication

Slow restoration due to chunk fragmentation is a serious problem facing inline chunk-based data deduplication systems: restore speeds for the most recent backup can drop orders of magnitude over the lifetime of a system. We study three techniques—increasing cache size, container capping, and using a forward assembly area— for alleviating this problem. Container capping is an ingest-time operati...

متن کامل

The assignment of chunk size according to the target data characteristics in deduplication backup system

This paper focuses on the trade-off between the deduplication rate and the processing penalty in backup system which uses a conventional variable chunking method. The trade-off is a nonlinear negative correlation if the chunk size is fixed. In order to analyze quantitatively the trade-off all over the factors, a simulation approach is taken and clarifies the several correlations among chunk siz...

متن کامل

An Efficient Data Deduplication based on Tar-format Awareness in Backup Applications

Disk-based backup storage system is utilized widely, and data deduplication is becoming an essential technique in the system because of the advantage of a spaceefficiency. Usually, user’s several files are aggregated into a single Tar file at primary storage, and the Tar file is transferred and stored to the backup storage system periodically (e.g., a weekly full backup) [1]. In this paper, we ...

متن کامل

Primary Data Deduplication - Large Scale Study and System Design

We present a large scale study of primary data deduplication and use the findings to drive the design of a new primary data deduplication system implemented in the Windows Server 2012 operating system. File data was analyzed from 15 globally distributed file servers hosting data for over 2000 users in a large multinational corporation. The findings are used to arrive at a chunking and compressi...

متن کامل

ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching

Data deduplication has been widely applied in storage systems to improve the efficiency of space utilization. In data deduplication systems, the data restore performance is seriously hindered by read amplification since the accessed data chunks are scattered over many containers. A container consisting of hundreds or thousands data chunks is the data unit to be read from or write to the storage...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015